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Abstract. We introdnce Quasi-Threshold Mover (QTM), an algorithm 
to solve the qnasi-threshold (also called trivially perfect) graph editing 
problem with edge insertion and deletion. Given a graph it computes a 
quasi-threshold graph which is close in terms of edit count. This edit 
problem is NP-hard. We present an extensive experimental study, in 
which we show that QTM is the first algorithm that is able to scale to 
large real-world graphs in practice. As a side result we further present a 
simple linear-time algorithm for the quasi-threshold recognition problem. 


1 Introduction 


Quasi-Threshold graphs, also known as trivially perfect 
graphs, are defined as the P4- and C'4-free graphs, i.e., the 
graphs that do not contain a path or cycle of length 4 as 
node-induced subgraph 21 . They can also be character¬ 
ized as the transitive closure of rooted forests 20 , as illus¬ 


trated in Figure [2 These forests can be seen as skeletons of 
quasi-threshold graphs. Further a constructive character¬ 
ization exists: Quasi-threshold graphs are the graphs that 
are closed under disjoint union and the addition of isolated 



nodes and nodes connected to every existing node 21 


Linear time quasi-threshold recognition algorithms 
and in . Both construct a skeleton 


were proposed in 21 


Fig. 1 : Quasi-thres. 
graph with thick 
skeleton, grey root 
and dashed transi¬ 
tive closure. 


if the graph is a quasi-threshold graph. Further, also 
finds a C4 or P4 if the graph is no quasi-threshold graph. 
Nastos and Gao 


15 observed that components of quasi-threshold graphs 


have many features in common with the informally defined notion of communities 
in social networks. They propose to find a quasi-threshold graph that is close to 
a given graph in terms of edge edit distance in order to detect the communities 
of that graph. Motivated by their insights we study the quasi-threshold graph 
editing problem in this paper. Given a graph G = (V, E) we want to find a quasi¬ 
threshold graph G' = (V, E') which is closest to G, i.e., we want to minimize the 
number k of edges in the symmetric difference of E and E'. Figure [^illustrates 
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an edit example. Unfortunately, the quasi¬ 
threshold graph editing problem is NP-hard 


15 


Fig. 2: Edit example with 
solid input edges, dashed 
inserted edges, a crossed 
deleted edge, a thick skele¬ 
ton with grey root. 


duced by Nastos and Gao 


15 


However, the problem is fixed parameter tractable 
(FPT) in k as it is defined using forbidden sub¬ 
graphs [^. A basic bounded search tree algo¬ 
rithm which tries every of the 6 possible edits 
of a forbidden subgr^h has a running time in 
0(6^- (|U| -|-|i?|)). In a polynomial kernel of size 
O(fc^) was introduced. Unfortunately, our exper¬ 
iments show that real-world social networks have 
a prohibitively large amount of edits. We prove 
lower bounds on real-world graphs for k on the 
scale of 10"^ and 10®. A purely FPT-based algo¬ 
rithm with parameter k can thus not scale in prac¬ 
tice. The only heuristic we are aware of was intro- 
but it examines all (9(|Up) possible edits in each 


greedy editing step and thus needs time I7(fc • |Up). Even though this running 
time is polynomial it is still prohibitive for large graphs. In this paper we fill 
this gap by introducing Quasi-Threshold Mover (QTM), the first scalable quasi¬ 
threshold editing algorithm. The final aim of our research is to determine whether 
quasi-threshold editing is a useful community detection algorithm. Designing an 
algorithm able of solving the quasi-threshold editing problem on large real-world 
graphs is a first step in this direction. 


1.1 Our Contribution 

Our main contribution is Quasi-Threshold Mover (QTM), a scalable quasi¬ 
threshold editing algorithm. We provide an extensive experimental evaluation 
on generated as well as a variety of real-world graphs. We further propose a 
simplified certifying quasi-threshold recognition algorithm. QTM works in two 
phases: An initial skeleton forest is constructed by a variant of our recognition 
algorithm, and then refined by moving one node at a time to reduce the num¬ 
ber of edits required. The running time of the first phase is dominated by the 
time needed to count the number of triangles per edge. The best current triangle 
counting algorithms run in 0{\E\a{G)) time, where a{G) is the arboricity. 

These algorithms are efficient and scalable in practice on the considered graphs. 
One round of the second phase needs 0(|U| -I- jifllogA) time, where A is the 
maximum degree. We show that four rounds are enough to achieve good results. 


1.2 Preliminaries 

We consider simple, undirected graphs G = {V,E) with n = \V\ nodes and 
m = \E\ edges. For u € U let N{v) be the adjacent nodes of v. Let d{v) := |A^(u)| 
for V £ V he the degree of v and A the maximum degree in G. Whenever we 
consider a skeleton forest, we denote by p{u) the parent of a node u. 
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2 Lower Bounds 


A lot of previous research has focused on FPT-based algorithms. To show that no 
purely FPT-based algorithm parameterized in the number of edits can solve the 
problem we compute lower bounds on the number of edits required for real-world 
graphs. The lower bounds used by us are far from tight. However, the bounds are 
large enough to show that any algorithm with a running time superpolynomial 
in k can not scale. 

To edit a graph we must destroy all forbidden subgraphs H. For quasi¬ 
threshold editing H is either a P 4 or a < 74 . This leads to the following basic 
algorithm: Find forbidden subgraph H, increase the lower bound, remove all 
nodes of H, repeat. This is correct as at least one edit incident to H is neces¬ 
sary. If multiple edits are needed then accounting only for one is a lower bound. 
We can optimize this algorithm by observing that not all nodes of H have to be 
removed. If is a P 4 with the structure A — B — C — D \i \s enough to remove 
the two central nodes B and C. If if is a with nodes A, B, C, and D then it 
is enough to remove two adjacent nodes. Denote by B and C the removed nodes. 
This optimization is correct if at least one edit incident to B or C is needed. 
Regardless of whether iJ is a P 4 or a C 4 the only edit not incident to B or C is 
inserting or deleting {A,D}. However, this edit only transforms a P 4 into a C 4 
or vice versa. A subsequent edit incident to B or C is thus necessary. 

H can be found using the recognition algorithm. However, the resulting run¬ 
ning time of 0{k(n + m)) does not scale to the large graphs. In the appendix we 
describe a running time optimization to accelerate computations. 


3 Linear Recognition and Initial Editing 


The first linear time recognition algorithm for quasi-threshold graphs was pro- 
In [^, a linear time certifying recognition algorithm based on 


posed in 21 


lexicographic breadth first search was presented. However, as the authors note, 
sorted node partitions and linked lists are needed, which result in large con¬ 
stants behind the big-0. We simplify their algorithm to only require arrays but 
still provide negative and positive certificates. Further we only need to sort the 
nodes once to iterate over them by decreasing degree. Our algorithm constructs 
the forest skeleton of a graph G. If it succeeds G is a quasi threshold graph and 
outputs for each node v a parent node p{v). If it fails it outputs a forbidden 
subgraph H. 

To simplify our algorithm we start by adding a super node r to G that is 
connected to every node and obtain G'. G is a quasi threshold graph if and 
only if G' is one. As G' is connected its skeleton is a tree. A core observation is 
that higher nodes in the tree must have higher degrees, i.e., d{v) < d{p{v)). We 
therefore know that r must be the root of the tree. Initially we set p{u) = r for 
every node u. We process all remaining nodes ordered decreasingly by degree. 
Once a node is processed its position in the tree is fixed. Denote by u the node 
that should be processed next. We iterate over all non-processed neighbors v olu 
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and check whether p{u) = p{v) holds and afterwards set p{v) to u. If p(u) = p{v) 
never fails then G is a quasi-threshold graph as for every node x (except r) we 
have that by construction that the neighborhood of a; is a subset of the one of 
p{x). If p{u) 7^ p{v) holds at some point then a forbidden subgraph H exists. 
Either p(u) or p{v) was processed first. Assume without lose of generality that it 
was p{v). We know that no edge {v,p(u)) can exist because otherwise p{u) would 
have assigned itself as parent of v when it was processed. Further we know that 
p(u)’s degree can not be smaller than m’s degree as p{u) was processed before 
u. As u is a neighbor of u we know that another node x must exist that is a 
neighbor of p{u) but not of u, i.e., (m, x) does not exist. The subgraph H induced 
by the 4 -chain v — u — p{u) — a; is thus a P4 or G4 depending on whether the edge 
{v,x) exists. We have that m 7^ r as u is processed by the algorithm and v ^ r 
as its degree is at most d{u). Further p(u) 7^ r as p{v) was processed before p{u) 
and X 7^ r as r is a neighbor of u. H therefore does not use r and is contained 
in G. 

From Recognition to Editing. We modify the recognition algorithm to construct 
a skeleton for arbitrary graphs. This skeleton induces a quasi threshold graph 
Q. We want to minimize Q’s distance to G. Note that all edits are performed 
implicitly, we do not actually modify the input graph for efficiency reasons. 
The only difference between our recognition and our editing algorithm is what 
happens when we process a node u that has a non-processed neighbor v with 
p{u) 7^ p{v). The recognition algorithm constructs a forbidden subgraph F[, 
while the editing algorithm tries to resolve the problem. We have three options 
for resolving the problem: we ignore the edge {u, u}, we set p{v) to p{u), or we set 
p{u) to p{v). The last option differs from the first two as it affects all neighbors 
of u. The hrst two options are the decision if we want to make v a child of u 
even though p{u) 7^ p{v) or if we want to ignore this potential child. We start 
by determining a preliminary set of children by deciding for each non-processed 
neighbor of u whether we want to keep or discard it. These preliminary children 
elect a new parent by majority. We set p{u) to this new parent. Changing u’s 
parent can change which neighbors are kept. We therefore reevaluate all the 
decisions and obtain a hnal set of children for which we set u as parent. Then 
the algorithm simply continues with the next node. 

What remains to describe is when our algorithm keeps a potential child. It 
does this using two edge measures: The number of triangles t(e) in which an edge 
e participates and a pseudo-G4-P4-counter Pc(e), which is the sum of the number 
of G4 in which e participates and the number of P4 in which e participates as 
central edge. Computing pc{x,y) is easy given the number of triangles and the 
degrees of x and y as Pc{{x, y}) = {d{x) - 1 - t{{x, y})) ■ {d{y) - 1 - t{{x, y})) 
holds. Having a high Pc{e) makes it likely that e should be deleted. We keep a 
potential child only if two conditions hold. The first is based on triangles. We 
know by construction that both u and v have many edges in G towards their 
current ancestors. Keeping v is thus only useful if u and v share a large number 
of ancestors as otherwise the number of induced edits is too high. Each common 
ancestor of u and v results in a triangle involving the edge {m, w} in Q. Many of 
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1 foreach Vm-neighbor u do 

2 ^ push m; 


3 

4 

5 

6 

7 

8 
9 

10 

11 


while queue not empty do 
u ■«— pop; 

determine childciose(M) by DFS; 

X <r- max over scoremax of reported u-children; 
y over childdose of close u-children; 

if u is Vm-neighbor then 
I scoremax(u) ■<- max{a;, y} -|- 1; 
else 

|_ scoremax(u) max{x, y} - 1; 


12 

13 

14 


if childdose (u) > 0 or scoremax (u) > 0 then 
report u to p(u); 
push p(u); 


15 Best Um-parent corresponds to scoremax(r); 


(a) Pseudo-Code for moving Vm 



Fig. 3: In Figure the drawn edges are in the skeleton. Crossed edges are 
removed while thick blue edges are inserted by moving Vm- o, is not adopted 
while b is. 


these triangles should also be contained in G. We therefore count the triangles 
of {u, u} in G and check whether there are at least as many triangles as v has 
ancestors. The other condition uses Pc(e). The decision whether we keep v is in 
essence the question of whether {u, u} or {v,p(v)} should be in Q. We only keep 
V if Pc{{u,v}) is not higher than pc{{v,p{v)}). The details of the algorithm can 
be found in the appendix. The time complexity of this heuristic editing algorithm 
is dominated by the triangle counting algorithm as the rest is linear. 


4 The Quasi-Threshold Mover Algorithm 


Our algorithm iteratively increases the quality of a skeleton T using an algorithm 
based on local moving. Local moving is a successful technique that is employed in 
many heuristic community detection algorithms [^ |12[[l7] . As in most algorithm 
based on this principle, our algorithm works in rounds. In each round it iterates 
over all nodes Vm in random order and tries to move Vm- In the context of 
community detection, a node is moved to a neighboring community such that a 
certain objective function is increased. In our setting we want to minimize the 
number of edits needed to transform the input graph G into the quasi-threshold 
graph Q implicitly defined by T. We need to define the set of allowed moves 
for Vm in our setting. Moving Vm consists of moving Vm to a different position 


within T and is illustrated in Figure 3b We need to chose a new parent u for 
Vm- The new parent of Vm’s old children is Um’s old parent. Besides choosing 
the new parent u we select a set of children of u that are adopted by Vm, i-O., 
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their new parent becomes Vm- Among all allowed moves for Vm we chose the 
move that reduces the number of edits as much as possible. Doing this in sub¬ 
quadratic running time is difficult as Vm might be moved anywhere in G. By 
only considering the neighbors of Vm in G and a few more nodes per neighbor 
in a bottom-up scan in the skeleton, our algorithm has a running time in 0 (n + 
mlogA) per round. While our algorithm is not guaranteed to be optimal as a 
whole we can prove that for each node Vm we choose a move that reduces the 
number of edits as much as possible. Our experiments show that given the result 
of the initialization heuristic our moving algorithm performs well in practice. 
They further show that in practice four rounds are good enough which results 
in a near-linear total running time. 

Basic Idea. Our algorithm starts by isolating Vm, i.e., removing all incident 
edges in Q. It then finds a position at which Vm should be inserted in T. If v^s 
original position was optimal then it will find this position again. For simplicity 
we will assume again that we add a virtual root r that is connected to all nodes. 
Isolating Vm thus means that we move Vm below the root r and do not adopt 
any children. Choosing u as parent of Vm requires Q to contain edges from all 
ancestors of u to Vm- Further if Vm adopts a child w of m then Q must have an 
edge from every descendant of w to Vm- How good a move is depends on how 
many of these edges already exist in G and how many edges incident to Vm in 
G are not covered. To simplify notation we will refer to the nodes incident to 
Vm in G as Vm-neighhors. We start by identifying which children a node should 
adopt. For this we define the child closeness childciose(M) of u as the number 
of Um-neighbors in the subtree of u minus the non-Um-neighbors. A node u is 
a close child if childciose(M) > 0. If Vm chooses a node u as new parent then 
it should adopt all close children. A node can only be a close child if it is a 
neighbor of Vm or when it has a close child. Our algorithm starts by computing 
all close children and their closeness using many short DFS searches in a bottom 
up fashion. Knowing which nodes are good children we can identify which nodes 
are good parents for Vm- A potential parent must have a close child or must 
be a neighbor of Vm- Using the set of close children we can easily derive a set 
of parent candidates and an optimal selection of adopted children for every 
potential parent. We need to determine the candidate with the fewest edits. We 
do this in a bottom-up fashion.To implement the described moving algorithm we 
need to pnt 0(dc{vm)) elements into a priority queue. The running time is thus 
amortized 0{dc{vm) log dc{vm)) per move or 0(n+m log A) per round. We start 
many small searches and analyze their running time complexity using tokens. 
Initially only the Um-neighbors have tokens. A search consumes a token per step. 
The details of the analysis are complex and are described in the appendix. 

Close Children. To find all close children we attach to each node u a DFS instance 
that explores the subtree of u. Note that every DFS instance has a constant state 
size and thus the memory consumption is still linear, u is close if this DFS finds 
more Um-neighbors than non-Um-neighbors. Unfortunately we can not fully run 
all these searches as this requires too much running time. Therefore a DFS is 
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aborted if it finds more non-nm-neighbors than rim-neighbors. We exploit that 
close children are n^-neighbors or have themselves close children. Initially we 
fill a queue of potential close children with the neighbors of and when a new 
close child is found we add its parent to the queue. Let u denote the current node 
removed from the queue. We run m’s DFS and if it explores the whole subtree 
then M is a close child. We need to take special care that every node is visited 
only by one DFS. A DFS therefore looks at the states of the DFS of the nodes 
it visits. If one of these other DFS has run then it uses their state information 
to skip the already explored part of the subtree. To avoid that a DFS is run 
after its state was inspected we organize the queue as priority queue ordered by 
tree depth. If the DFS of u starts by first inspecting the wrong children then 
it can get stuck because it would see the r^m-neighbors too late. The DFS must 
first visit the close children of u. To assure that u knows which children are 
close every close child must report itself to its parent when it is detected. As all 
children have a greater depth they are detected before the DFS of their parent 
starts. 


Potential Parents. Suppose we consider the subtree Tu of u and w is a po¬ 
tential parent in T^. Consider the set of nodes X^, given by the ancestors of 
w and the descendants of all close children of w. includes w and its close 
children themselves. Moving Vm below w requires us to insert an edge from Vm 
to every non-Um-neighbor in X^j. We therefore want to maximize the num¬ 
ber of t;m-neighbors minus the number of non-Um-neighbors. This value gives 
us a score for each potential parent in r„. We denote by score„iax(u) the max¬ 
imum score over all potential parents in r„. Note that scoremax(w) is always 
at least -1 as we can move Vm below u and not adopt any children. We deter¬ 
mine in a bottom-up fashion all scoreinax(u) that are greater than 0. Whether 
scoreniax(w) is -1 or 0 is irrelevant because isolating Vm is never worse. The fi¬ 
nal solution will be in scoreniax(T') of the root r as its “subtree” encompasses 
the whole graph. score„iax(w) can be computed recursively. If u is a best parent 
then the value of scoreniax(M) is the sum over the closenesses of all of it’s close 
children ±1. If the subtree T^, of a child w of m contains a best parent then 
scoremax('w) = scoreniax(tc) ± 1. The ±1 depends on whether w is a 'Cm-neighbor. 
Unfortunately not only potential parents u have a scoreniax(w) > 0. However, 
we know that every node u with scoreniax(^f) > 0 is a rim-neighbor or has a 
child w with scoremax(ic) > 0. We can therefore process all scoremax values in a 
similar bottom-up way using a tree-depth ordered priority queue as we used to 
compute childciose- As both bottom-up procedures have the same structure we 
can interweave them as optimization and use only a single queue. The algorithm 
is illustrated in Figure 3a in pseudo-code form. 


5 Experimental Evaluation 


We evaluated the QTM algorithm on the small instances used by Nastos and 
Gao 15 , on larger generated graphs and large real-world social networks and 
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web graphs. We measured both the number of edits needed and the required 
running time. For each graph we also report the lower bound b of necessary 
edits that we obtained using our lower bound algorithm. We implemented the 
algorithms in C++ using NetworKit 18 . All experiments were performed on 


an Intel Core i7-2600K CPU with 32GB RAM. We ran all algorithms ten times 
with ten different random node id permutations. 


Comparison with Nastos and Gao’s Results. Nastos and Gao [15| did not report 
any running times, we therefore re-implemented their algorithm. Our implemen¬ 
tation of their algorithm has a complexity of 0 {m? + k ■ ■ m), the details 

can be found in the appendix. Similar to their implementation we used a simple 
exact bounded search tree (BST) algorithm for the last 10 edits. In Table we 
report the minimum and average number of edits over ten runs. Our implemen¬ 
tation of their algorithm never needs more edits than they reportecQ Often our 
implementation needs slightly less edits due to different tie-breaking rules. 

For all but one graph QTM is at least as good as the algorithm of Nastos and 
Gao in terms of edits. QTM needs only one more edit than Nastos and Gao for 
the grass_web graph. The QTM algorithm is much faster than their algorithm, 
it needs at most 2.5 milliseconds while the heuristic of Nastos and Gao needs up 
to 6 seconds without bounded search tree and almost 17 seconds with bounded 
search tree. The number of iterations necessary is at most 5. As the last round 
only checks whether we are finished four iterations would be enough. 


Large Graphs. For the results in Table we used two Facebook graphs 19 and 
five SNAP graphs 14 as social networks and four web graphs from the 10th 
DIMACS Implementation Ghallenge (Dlllllli- We evaluate two variants of 
QTM. The first is the standard variant which starts with a non-trivial skeleton 
obtained by the heuristic described in Section The second variant starts with 
a trivial skeleton where every node is a root. We chose these two variants to 
determine which part of our algorithm has which influence on the final result. 
For the standard variant we report the number of edits needed before any node 
is moved. With a trivial skeleton this number is meaningless and thus we report 
the number of edits after one round. All other measures are straightforward and 
are explained in the table’s caption. 

Even though for some of the graphs the mover needs more than 20 iterations 
to terminate, the results do not change significantly compared to the results 
after round 4. In practice we can thus stop after 4 rounds without incurring a 
significant quality penalty. It is interesting to see that for the social networks the 
initialization algorithm sometimes produces a skeleton that induces more than m 
edits (e.g. in the case of the “Penn” graph) but still the results are always slightly 
better than with a trivial initial skeleton. This is even true when we do not abort 
moving after 4 rounds. For the web graphs, the non-trivial initial skeleton does 
not seem to be useful for some graphs. It is not only that the initial number of 
edits is much higher than the finally needed number of edits, also the number of 


Except on Karate, where they report 20 due to a typo. They also need 21 edits. 
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Table 1: Comparison of QTM and [^. We report n and m, the lower bound 6, 
the number of edits (as minimum, mean and standard deviation), the mean and 
maximum of number of QTM iterations, and running times in ms. 


Name 

n 

m 

b 

Algorithm 

min 

Edits 

mean 

std 

Iterations 

mean max 

Time [ms] 
mean std 





QTM 

72 

74.1 

1.1 

2.7 

4.0 

0.6 0.1 

dolphins 

62 

159 

24 

NG w/ BST 

73 

74.7 

0.9 

- 

- 

15 594.0 2 019.0 





NG w/o BST 

73 

74.8 

0.8 

- 

- 

301.3 4.0 





QTM 

251 

254.3 

2.7 

3.5 

4.0 

2.5 0.4 

football 

115 

613 

52 

NG w/ BST 

255 

255.0 

0.0 

- 

- 

16 623.3 3 640.6 





NG w/o BST 

255 

255.0 

0.0 

- 

- 

6 234.6 37.7 





QTM 

35 

35.2 

0.4 

2.0 

2.0 

0.5 0.1 

grass_web 

86 

113 

10 

NG w/ BST 

34 

34.6 

0.5 

- 

- 

13 020.0 3 909.8 





NG w/o BST 

38 

38.0 

0.0 

- 

- 

184.6 1.2 





QTM 

21 

21.2 

0.4 

2.0 

2.0 

0.4 0.1 

karate 

34 

78 

8 

NG w/ BST 

21 

21.0 

0.0 

- 

- 

9 676.6 607.4 





NG w/o BST 

21 

21.0 

0.0 

- 

- 

28.1 0.3 





QTM 

60 

60.5 

0.5 

3.3 

5.0 

1.4 0.3 

lesmis 

77 

254 

13 

NG w/ BST 

60 

60.8 

1.0 

- 

- 

16919.1 3487.7 





NG w/o BST 

60 

77.1 

32.4 

- 

- 

625.0 226.4 


edits needed in the end is slightly higher than if a trivial initial skeleton was used. 
This might be explained by the fact that we designed the initialization algorithm 
with social networks in mind. Initial skeleton heuristics built specifically for web 
graphs could perform better. While the QTM algorithm needs to edit between 
approximately 50 and 80% of the edges of the social networks, the edits of the 
web graphs are only between 10 and 25% of the edges. This suggests that quasi¬ 
threshold graphs might be a good model for web graphs while for social networks 
they represent only a core of the graph that is hidden by a lot of noise. Concerning 
the running time one can clearly see that QTM is scalable and suitable for large 
real-world networks. 

As we cannot show for our real-world networks that the edit distance that we 
get is close to the optimum we generated graphs by generating quasi-threshold 
graphs and applying random edits to these graphs. The details of the generation 
process are described in the appendix. In Table we report the results of two 
of these graphs with 400 and 160 000 random edits. In both cases the number 
of edits the QTM algorithm finds is below or equal to the generated editing 
distance. If we start with a trivial skeleton, the resulting edit distance is some¬ 
times very high, as can be seen for the graph with 400 edits. This shows that 
the initialization algorithm from Section is necessary to achieve good quality 
on graphs that need only few edits. As it seems to be beneficial for most graphs 
and not very bad for the rest, we suggest to use the initialization algorithm for 
all graphs. 
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Table 2: Results for large real-world and generated graphs. Number of nodes 
n and edges m, the lower bound b and the number of edits are reported in 
thousands. Column “I” indicates whether we start with a trivial skeleton or 
not. • indicates an initial skeleton as described in Section [3] and o indicates a 
trivial skeleton. Edits and running time are reported for a maximum number of 
0 (respectively 1 for a trivial initial skeleton), 4 and oo iterations. For the latter, 
the number of actually needed iterations is reported as “It”. Edits, iterations 
and running time are the average over the ten runs. 



Name 

n [K] 

b[K] 

I 


Edits [K] 


It 


Time [s 




m [K] 



0/1 

4 

OO 

OO 

0/1 

4 

OO 


Caltech 

0.77 

0.35 

• 

15.8 

11.6 

11.6 

8.5 

0.0 

0.0 

0.1 


16.66 

O 

12.6 

11.7 

11.6 

9.4 

0.0 

0.0 

0.1 


amazon 

335 

99.4 

• 

495 

392 

392 

7.2 

0.3 

5.5 

9.3 

V} 

Si 

926 

o 

433 

403 

403 

8.9 

1.3 

4.9 

10.7 

dblp 

317 

53.7 

• 

478 

415 

415 

7.2 

0.4 

5.8 

9.9 

s 

1050 

o 

444 

424 

423 

9.0 

1.4 

5.2 

11.5 


Penn 

41.6 

19.9 

• 

1499 

1129 

1127 

14.4 

0.6 

4.2 

13.5 

* 

1362 

o 

1174 

1133 

1129 

16.2 

1.0 

3.7 

14.4 

o 

o 

CO 

youtube 

1135 

139 

• 

2169 

1961 

1961 

9.8 

1.4 

31.3 

73.6 

2 988 

o 

2 007 

1983 

1983 

10.0 

7.1 

28.9 

72.7 


Ij 

3 998 

1335 

• 

32 451 

25 607 

25 577 

18.8 

23.5 

241.9 

1036.0 


34 681 

o 

26 794 

25 803 

25 749 

19.9 

58.3 

225.9 

1101.3 


orkut 

3 072 

1480 

• 

133086 

103 426 

103 278 

24.2 

115.2 

866.4 

4601.3 


117185 

o 

106 367 

103 786 

103507 

30.2 

187.9 

738.4 

5 538.5 


cnr-2000 

326 

48.7 

• 

1028 

409 

407 

11.2 

0.8 

12.8 

33.8 


2 739 

o 

502 

410 

409 

10.7 

3.2 

11.8 

30.8 


in-2004 

1383 

195 

• 

2 700 

1402 

1401 

11.0 

7.9 

72.4 

182.3 











Si 

13 591 

o 

1909 

1392 

1389 

13.5 

16.6 

65.0 

217.6 


eu-2005 

863 

229 

• 

7613 

3 917 

3 906 

13.7 

6.9 

90.7 

287.7 


16139 

o 

4 690 

3 919 

3 910 

14.5 

22.6 

85.6 

303.5 


uk-2002 

18 520 

2 966 

• 

68 969 

31218 

31178 

19.1 

200.6 

1638.0 

6 875.5 


261787 

o 

42193 

31092 

31042 

22.3 

399.8 

1609.6 

8651.8 


Gen. 

100 

42 

• 

200 

158 

158 

4.6 

0.2 

3.5 

4.1 


160K 

930 

o 

193 

158 

158 

6.1 

1.0 

3.3 

4.9 

(V 

Gen. 

1000 

0.391 

• 

1.161 

0.395 

0.395 

3.0 

3.3 

43.8 

43.8 

d 

0.4K 

10 649 

o 

182 

5.52 

5.52 

6.1 

15.9 

52.9 

78.8 
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Case Study: Caltech. The main application 
of our work is community detection. While a 
thorough experimental evaluation of its use¬ 
fulness in this context is future work we want 
to give a promising outlook. Figure de¬ 
picts the edited Caltech university Facebook 
network from 19 . Nodes are students and 


edges are Facebook-friendships. The dormito¬ 
ries of most students are known. We colored 
the graph according to this ground-truth. The 
picture clearly shows that our algorithm suc¬ 
ceeds at identifying most of this structure. 



6 Conclusion 

We have introduced Quasi-Threshold Mover 
(QTM), the first heuristic algorithm to solve 
the quasi-threshold editing problem in practice for large graphs. As a side re¬ 
sult we have presented a simple certifying linear-time algorithm for the quasi¬ 
threshold recognition problem. A variant of our recognition algorithm is also used 
as initialization for the QTM algorithm. In an extensive experimental study with 
large real world networks we have shown that it scales very well in practice. We 
generated graphs by applying random edits to quasi-threshold graphs. QTM suc¬ 
ceeds on these random graphs and often even finds other quasi-threshold graphs 
that are closer to the edited graph than the original quasi-threshold graph. A 
surprising result is that web graphs are much closer to quasi-threshold graphs 
than social networks, for which quasi-threshold graphs were introduced as com¬ 
munity detection method. A logical next step is a closer examination of the 
detected quasi-threshold graphs and the community structure they induce. Fur¬ 
ther our QTM algorithm might be adapted for the more restricted problem of 
threshold editing which is NP-hard as wellj^ 


Fig. 4: Edited Caltech network, 
edges colored by dormitories of 
endpoints. 
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A Fast Computation of Lower Bounds 

As outlined in the main paper, the idea for computing lower bounds is to find a 
(74 or P 4 and to remove two of the nodes, two neighboring nodes in the case of a 
C 4 , the two central nodes in the case of the P 4 , such that we destroy as few C 4 
and P 4 as possible. Finding a single P 4 or C 4 is possible in linear time using the 
certifying recognition algorithm. The challenge when designing a fast algorithm 
for computing lower bounds is that the lower bound can be n /2 which results in 
a quadratic algorithm. 

It is enough if we identify all central edges of a P 4 as we only want to remove 
the two nodes that are incident to that edge anyway. For a C 4 it is also enough 
if we can identify any edge it is part of as we also just want to remove the two 
incident nodes. Therefore it is enough if we can quickly find an edge that is part 
of a (74 or a central edge of a P 4 . 

If we consider the neighbors of the two nodes u and v and want to find a 
C 4 or a P 4 where {rt, w} is the central edge, we only need to find two nodes 
X G N{u) \ {z)} and y G N{v) \ {m} such that x ^ N{v) and y ^ N{u). Common 
neighbors of u and v thus cannot be chosen for x and y, however all other 
neighbors besides u and v can be chosen. Therefore we know that such two nodes 
exist whenever pc{{u,v}) = {d{u) — 1 — t{{u,v})) ■ {d{v) — 1 — t{{u,v})) > 0 . 
The algorithm we choose is based on this observation. Initially, we count the 
triangles per edge for all edges. Then we iterate over all nodes and for each node 
u we choose a neighbor v such that pc{{u,v}) > 0 and remove u and v. After 
removing u and v we update the triangle counters accordingly. 

In order to destroy not too many P 4 and C 4 , we sort the nodes initially be 
degree in ascending order. We also choose the neighbor v ot u such that the 
degree of v is minimal. Note that the initial iteration order does not necessarily 
reflect the degree order anymore after removing some of the nodes. 

Given a graph structure that allows removing a node u in amortized time 
0{d(u)) the whole algorithm can actually be implemented in time 0{a(G)m) 
with 0{a{G)m) memory consumption. The running time 0(a{G)m) comes from 
triangle listing . The main idea is that we store for each edge the pairs of edges 
which form a triangle. Whenever we delete a node, we check for each edge for all 
stored pairs if the two other edges still exist, and if yes, decrease their counter. 
As we delete each edge only once this gives a total running time of 0{a(G)m). 

However in practice we found that the required amount of storage was too 
high for our compute servers. Even if we have “just” 40 triangles per edge on 
average (for example the web graph “eu-2005” from the 10th DIMACS Imple¬ 
mentation Challenge [^) storing these triangles means that we need a lot more 
memory than for just storing G. 

Therefore we used the trivial update algorithm that, for deleting the edge 
{u,v}, enumerates all triangles the edge is part of and updates the counters 
accordingly. For deleting all edges this gives a 0{m ■ A) algorithm which only 
needs 0(m) memory. In practice this was still fast enough for the graphs we con¬ 
sidered. In Table 1^ we report the lower bound and the running time of the lower 
bound calculation for the large real-world graphs we considered (refer to the ex- 
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perimental evaluation, Section]^ for details concerning the graphs). The graphs 
are sorted by the number of edges m. As in the experimental evaluation, we 
executed all experiments ten times with different random node id permutations. 
Only for the largest graph, uk-2002, we used only one run with the original node 
ids for the lower bound calculation due to memory constraints. In this table we 
report average and maximum bound and running time while in the experimental 
evaluation section we only reported the maximum. However, as one can see, the 
average and the maximum do not differ significantly. The running times clearly 
show that the running time does not only depend on m but also on the degrees, 
i.e. graphs with a lower number of nodes but a comparable number of edges have 
a higher running time. 


Table 3: Results for the lower bounds of the large real-world graphs we considered 


Name 


Lower Bound 
mean max 


Time [s] 
mean max 


Caltech36 

769 

16 656 

349.5 

350 

0.0 

0.0 

com-amazon 

334863 

925872 

99 305.9 

99 413 

0.6 

0.6 

com-dblp 

317080 

1 049 866 

53 656.9 

53 680 

0.7 

0.7 

Penn94 

41554 

1362 229 

19918.7 

19 920 

1.8 

1.8 

cnr-2000 

325557 

2 738 969 

48 500.0 

48 739 

22.6 

23.8 

com-youtube 

1 134 890 

2 987 624 

139 006.5 

139077 

9.7 

9.8 

in-2004 

1 382 908 

13 591473 

194849.9 

195 206 

70 

70.4 

eu-2005 

862 664 

16 138 468 

228 457.1 

228 759 

187.2 

188.1 

com-lj 

3 997 962 

34 681 189 

1334663.3 

1334 770 

65.8 

66.3 

com-orkut 

3 072 441 

117185 083 

1479 977.2 

1480 007 

394.2 

395.4 

uk-2002 

18 520 486 

261787 258 


2 966 359 


960.8 


B Details of the Initialization Algorithm 

In Algorithm we provide the full initialization heuristic as pseudo code. Note 
that while for the parent calculation we use < for comparisons we use < for the 
hnal selection of the neighbors to keep in order to not to wrongly assign too 
many neighbors to u. 


C The Qnasi-Threshold Mover in Detail 

Here we want to describe the quasi-threshold mover algorithm in more detail. 
Apart from giving more details how we actually implemented the algorithm 
in order to achieve the claimed running time we will also give proofs for its 
correctness and running time. 

The QTM algorithm iteratively modifies the forest that defines a quasi¬ 
threshold graph. In the following we assume again that our forest has a virtual 
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1 

2 

3 

4 

5 

6 

7 

8 
9 

10 


Input: G = (y,S) 

Output: Parent assignment p for each node 

Sort V by degree in descending order using bucket sort; 

p:P^i/U{0},ue^0; 

Count triangles t{{u,v}); 

foreach u £V do 

// Process node u 

■<— {w G N(u) I V not processed and p{u) = p{v) or 
(pc({w,u}) < Pc({w,p(u)}) and depth(u) < t{{u,v}) + 1)}; 
Pn ■‘r- the most frequent value of p{x) for x £ N-, 
if Pn 7 ^ p{u) then 
p(u) ^p„; 
depth(M) 0; 

Pc({u,p„}) oo; 


11 

12 

13 

14 


foreach v € A^(ii) that has not been processed do 
ifp{u)=p{v) or {pc{{u,v}) < pc{{v,p{v)}) and 
depth(t;) < t{{u,v}) + 1) then 
p{v) £- U-, 

depth(ii) depth(u) + 1; 


Algorithm 1: The Initialization Algorithm 


root r that is connected to all nodes in the original graph, i.e. we consider only 
the case of a tree. 

For a single node Vm the algorithm solves the following problem optimally: 

Find a parent u in the forest and a set of close children C of that parent u 
such that inserting Vm as child of u and moving C to be children of Vm minimizes 
the number of edits among all choices of u and C. 

One iteration of the algorithm consists of solving this problem for every node 
of the graph. We will show later that for a single node Vm this is possible in time 
0 {d(vm)^og{d(vm))) time amortized over an iteration, so the time for a whole 
iteration is in 0{n + mlog(Z\)). For t > 0 iterations the total running time is 
0 {t ■ (n + TOlog(A))). 

The main idea why this works is that we do not need to consider all possible 
parents but only those parents which are adjacent to Vm or which have a close 
child, i.e. a child of which more than half of the descendants are adjacent to Vm- 
Otherwise the existing edges do not compensate for the missing edges and we 
could as well add Vm as child of r, i.e. delete all edges in the original graph that 
are incident to Vm- We will show how to determine these possible parents and 
close children by visiting only a constant number of nodes for each neighbor of 
Vm that are determined by populating counters from the bottom to the top of 
the tree. 

In one iteration, the QTM algorithm simply iterates over all nodes in a 
random order. For each node Vm we search the optimal parent and children. 
Algorithm contains the pseudo code for the main part of this search. 
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1 // Assumption: Vm is not in the tree; 

2 Insert neighbors of Vm in the queue; 

3 while Queue is not empty do 

4 u pop node from queue; 

5 mark u as touched; 

6 if childciose(u) > scoremax(u) then scoremax(M) childciose(u); 

7 if u M marked as neighbor then 

8 |_ childciose(ti) ^ childciose(^i) + 2, scoremax(^i) scoremax(u) + 2; 


9 

10 

11 

12 

13 

14 

15 

16 

17 

18 


childclose ^ childclose(^i) — 1, scoreniax(^i) ^ scoremax(u) — 1; 
if childcioae(u) > 0 and u has children then // Start a DFS from u 
X <r- hrst child of u; 

while X ^ u do 

if X not touched or childciose(a:) < 0 then 
childcloae(M) childdoae (u) “ 1; 

X •<— DFS(a;); 

if childciose(u) < 0 then 
DFS(m) ^ X- 
break; 


19 

20 
21 


X •<— next node in DFS order after x below u; 
else 

a; •<— next node in DFS order after the subtree of x below u; 


22 

23 

24 

25 

26 

27 

28 


if It 7 ^ r then // Propagate information to parent 
if childciose(u) > 0 then 

childcloae(p(M)) chMcloae (p(u) ) + Childclose (u) ; 

Insert p{u) in queue; 

if scoremax(u) > scoreniax(p(u)) then 
scoremax(p(u)) ^ score max 1 
Insert p(u) in queue; 


Algorithm 2: Core algorithm of QTM: finding a new parent and children to 
be adopted. 


In order to avoid complicated special cases we first remove from the tre^ 
In the end we want to move Vm back to its initial position if no better position 
was found. If the initial position of Vm was the best position, then the algorithm 
will find it again. However, if there are multiple positions in the skeleton that 
induce the same number of edits, the algorithm will find any of these positions. 
In order to make sure that the algorithm terminates even if we do not limit 
the number of iterations we store the initial position of Vm, i.e. its children and 
its parent. We also count the number of edits that were necessary among its 
neighbors. If no improvement was possible, we move the node back to this initial 
position in the end. 


® This is equivalent to isolating Vm by inserting Vm below the virtual root r. 
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For a single node Vm that shall possibly be moved we will process its neighbors 
and possibly 0(d{vm)) other nodes ordered by decreasing depth. We maintain 
the list of these nodes in a priority queue that is initialized with the neighbors of 
Vm and sorted by depth. As we do not want to dynamically determine the depth 
of a node we calculate the depth initially and update it whenever we remove or 
insert a node in the forest. A marker is set for all neighbors of Vm in order to 
make it possible to determine in constant time if a node is adjacent to Vm- 

When we process a node u of the queue, we first determine if u is the best 
parent in the subtree of u, then we possibly visit some nodes below u using a 
special DFS in order to determine the child closeness of u and possibly insert its 
parent into the queue. We will later explain the details of the DFS. 

We store the score of the best solution in the subtree of u in scorei„ax(u) 
and the child closeness of u in childciose(M)- In order to avoid special cases we 
initialize score„iax(u) with —1 and childciose('a) with 0. Furthermore we store at 
each node u the state of the DFS that has possibly been started at u. In order to 
store the state we only store the last visited node. We store this node in DFS(m). 
We initialize DFS(t6) with u. 

At the end, we can find the number of edits that can be saved over isolating 
Vm in scoreinax(?’) and we can also additionally track which parent lead to that 
score. As already mentioned, we compare this to the number of edits at the old 
position of Vm and move Vm back to the old position if no improvement was 
possible. If an improvement is possible, we insert Vm below the parent that we 
identified as best parent u. The missing part are the children that shall be moved 
from u to Vm- We can determine them by visiting all previously visited nodes 
(we can store them) and check for each visited node c if it is a close child of u, 
i.e. if attaching it to Vm would save edits which we have stored in childciose(c). 

Proof of Correctness In this section we want to give a formal proof why the 
local moving algorithm is correct, i.e. always selects the best parent and the 
best selection of children. We do this by giving exact definitions of all used 
variables and proofing their correctness. 

We begin with the child closeness childciose(u) which is the number of edits 
that we can save if Vm is attached below w. 

Proposition 1. Either childciose(M) is the number of neighbors of Vm in the 
subtree ofu minus the number of non-neighbors, or there are more non-neighbors 
than neighbors in the subtree of u. In the latter case, if u has been processed, 
then childciose('w) = —1. More precisely if u has been processed, childciose(u) is 
the number of existing neighbors minus the number of missing neighbors of all 
nodes in DFS order between u and D{u) and additionally all subtrees of children 
c with childciose(c) > 0 that are not in the DFS order between u and DFS(m). 

Proof. We will give the proof by structural induction. 

As first step we want to establish that all nodes where childciose(u) > 0 are 
processed. As childciose(M) < 0 if there are no neighbors of Vm in the subtree 
of u only neighbors of Vm and their ancestors can have childciose(M) > 0. All 
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neighbors of Vm are processed (line[^. For non-neighbors it, childciose(M) > 0 
means that one of their children c is close, i.e. has childciose(c) > 0. As in this 
case c inserts Vm into the queue (line also in this case u will be processed. 
As we process all nodes by descending depth (only parents, i.e. nodes of smaller 
depth, are inserted in the queue) we can assume that if we are at a node it, all 
descendants of u that need to be processed have been processed and that when 
the algorithm terminates all nodes it with childciose(w) > 0 have been processed. 

In line and l^childciose('w) is updated such that the proposition is true if we 
consider only it itself. This means that the proposition is true for leafs which is 
also the initial step of our induction. 

As the claim is true for all children of it we can also assume that all children c 
with childciose(c) > 0 already updated childdose (w) accordingly, i.e. childciose(u) 
already correctly considers u and the values of all children with childciose(c) > 
- 1 . 

If we have childciose('a) = in line |10| childciose(M) must have been 0 initially 
as in the following it can only be decreased by 1 at maximum. Therefore we are 
in the situation that it is no neighbor of Vm and it has no close children, i.e. 
children with childciose(c) > 0. In this situation this is already the hnal result as 
if this result was incorrect, i.e. childdose (w) > — 1, then there must be at least as 
many neighbors as non-neighbors of Vm among the nodes in the subtree of u. As 
It is no neighbor of Vm the descendants must contain at least one more neighbor 
than there are non-neighbors among them and this must also be true for at 
least one of the children of it. Therefore childdose(c) > 0 for this child which is 
contradiction to the situation that there are no children c with childdose(c) > 0. 


which means that the algorithm will start a DFS. 

As first step we want to have a closer look at the DFS that is executed in 
the algorithm. If we say in the following that the DFS “visits” a node we mean 
that it is the value of x in line [121 

On visiting certain nodes, we decrease childdose(w). If childdose(u) < 0, we 
stop the DFS and store the last visited node in DFS(ii). This means that at the 
end of a DFS either childdose(M) > 0 or DFS(ii) points to the last visited node. 
Whenever we visit a node, there are three possible cases: 

1. The easiest case is that c has been processed and childdose(c) > —1. In this 
case the edits of c are already considered by the parent of c and we do not 
need to deal with it. Furthermore, we know that childciose(c) is the correct 
number of neighbors minus non-neighbors of the whole subtree of c so it is 
correct that the algorithm skips these nodes in line[^ 

2. If c has not been processed yet, it is a not a neighbor of v (otherwise it 
would have been processed). We decrease childciose(w) which is correct as 
this is a missing neighbor. Then we can continue the DFS. The same is true 
if c has been processed, childciose(c) < 0 but DFS(c) = c, i.e. no DFS has 
been executed. 

3. If childciose(c) < 0 and DFS(c) ^ c we know from the induction hypothesis 
that childciose(c) = ~1 and that this is exactly the number of existing minus 


So now we only need to consider the case that childdose (m) > — 1 in line 10 
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the number of missing neighbors from c up to DFS(c) in DFS order (including 
DFS(c)) plus the number of children c' of c with childciose(c') > —1 which 
we ignore anyway in the DFS. We decrease childdose (w) which correctly 
considers the nodes between c and DFS(c) in DFS order (both included). 
Then we jump to DFS(c) and do not visit DFS(c) but the next node in DFS 
order which is obviously correct as DFS(c) has already been considered by 
decreasing childciose('w). 

When the DFS ends, either we have now considered all edits of the descen¬ 
dants of u or the DFS ended with childciose(u) < 0 and we have stored the 
location of the last visited node in DFS(m). In the latter case, all nodes up to 
this point have been considered as we have outlined before. Therefore the claim 
is now also true for u. □ 

If we want to know for a potential parent u how many edits we can save 
by moving some of its children to Vm this is the sum of childciose(c) for all close 
children of u, i.e. children with childciose(c) > 0. This is the value that we store in 
childciose(M) before u is processed by setting childciose(p(c)) for all close children 
c of M, i.e. children c with childciose(c) > 0. Obviously, this is positive if a node 
u has at least one close child. 

In order to not to need to evaluate all nodes as potential parents we make 
use of the following observation: 

Proposition 2. Only nodes with close children and neighbors of Vm need to be 
considered as parents of Vm ■ 

Proof. Assume otherwise: The best parent u has no close children and is not a 
neighbor of Vm- Then attaching children of u to makes no sense as this would 
only increase the number of needed edits so we can assume that no children will 
be attached. However then choosing p{u) as parent of Vm will save one edit as 
u is no neighbor of Vm- This is a contradiction to the assumption that u is the 
best parent. □ 

So far we have only evaluated edits below nodes and identified all possible 
parents which are also processed as we have established before. The part that is 
still missing is the evaluation of the edits above a potential parent u. 

Theorem 3. Consider the subtree ofu. Then for the subgraph ofT^, scoremax)^) 
stores the maximum number of edits from Vm to nodes inside the subgraph of r„ 
that can be saved by choosing the parent of Vm in Tk instead of isolating Vm or 
— 1 if no edits can be saved. 

Proof. The proof is given by structural induction on the tree skeleton. We start 
with the initial step which is a node u that is a leaf of the tree. 

As T„ only consists of u, we have only one edge from Vm to u and therefore 
only two cases: u is a neighbor of Vm or not. In both cases, scoreniax(M) is ini¬ 
tialized with —1 as there are no children that could propagate any values. In the 
second case, u will not be processed but the result is already correct anyway: no 
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edits can be saved by choosing u as parent of Vm- In the first case, as childciose(u) 
is initialized to 0, childciose(w) > scoremax(w) and therefore scoreniax('w) ^ 0 
(line[^. We end with score„iax(w) = 1 which is correct, we can save an edit over 
isolating u as the edge {u,Vm) does not need to be deleted when we chose u 
as parent of Vm- When we set scoreinax('a) we can also store u as best parent 
together with scoreniax(u). 

Now we can assume that the theorem holds for all children of u. 

As not all nodes are processed, we need to explain why u is processed at 
all if scoreniax('w) > — 1 should hold. There are two possibilities: Either it could 
make sense to use u as parent or we could use a node below u as parent. In 
the first case by Proposition either m is a neighbor of Vm or u has a close 
child which means that u is processed. Assume that in the second case u was 
not processed but it should be score„iax(w) > —1. Further we can assume that 
u ^ N{vm) as otherwise u was processed. Let x be the best parent in T„ and let 
c be the direct child of u such that x is in the subtree of c (it is possible that 
X = c). If it makes sense to use x as parent of Vm then by inserting Vm below x 
also the edge {u,Vm} must be inserted. This means that in the subtree of c we 
can save one more edit as the edit {u,Vm} is not necessary which means that 
scorei„ax(c) = scoreniax('w) + 1 > 0. This means that by induction c must have 
been processed and c must have propagated scoreniax(c) to u and also inserted u 
in the queue which is a contradiction to the assumption that u is not processed. 

When u is processed, we need to make the decision if u is the best parent in 
Tu or if we should choose the parent below u. The edit of the edge {vm,u} is 
needed or not independent of the choice of the parent in Tu- Therefore we do not 
need to reconsider any decisions that were made below u. If we do not choose 
u, then we need to choose the best parent below u, i.e. the one of the subtree of 
the child c with the highest value of scoremax(c). This is also what the algorithm 
does by propagating scoreniax(c) to the parent as maximum of scoreniax(c) and 
scoremax(p(c)). Therefore scoreniax(w) is initialized to the best solution below u. 

If we want to determine how good u is as parent, we need to look at the 
closeness of its children. More specifically, we can save as many edits as the 
sum of childciose(c) for all close children c of u. This is the value to which 
childcioseCw) is initialized by its close children, therefore we only need to com¬ 
pare score„iax(M) to childciose(w). Therefore it is correct to set scoremax('u) to 
childciose(w) if childciose('w) is larger than scorei„ax(w). After this initial deci¬ 
sion, we increase or decrease scoreniax(w) by one depending on whether the edge 
{u,Vm} exists or not, this is obviously correct. □ 

As Tr is the whole graph, scoreniax(c) determines the best solution of the 
whole graph. Therefore the QTM algorithm optimally solves the problem of 
finding a new parent and a set of its children that shall be adopted. 


Proof of the Running Time After showing the correctness of the algorithm, we 
will now show that the running time is indeed 0{m\og{A)) per iteration and 
amortized 0(dlog(d)) per node. 
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During the whole algorithm we maintain a depth value for each node that 
specifies the depth in the forest at which the node is located. Whenever we move 
a node, we update these depth values. This involves decreasing the depth values 
of all descendants of the node in its original position and increasing the depth 
values of all descendants of the node at the new position. Unfortunately it is not 
obvious that this is possible in the claimed running time as a node Vm might 
have more than 0 {d{vm)) descendants. 

Note that a node is adjacent to all its descendants and ancestors in the edited 
graph. This means that every ancestor or descendant that is not adjacent to the 
node causes an insert. Therefore the node must be neighbor of at least half of 
the ancestors and children after a move operation as otherwise the less than half 
of the degree deletes are cheaper than the more than half of the degree inserts. 
This means that updating the depth values at the destination is possible in 0{d) 
time. 

For the update of the values in the original position we need a different, more 
complicated argument. First of all we assume that initially the total number of 
edits never exceeds the number of edges as otherwise we could simply delete all 
edges and get less edits. For amortizing the number of needed edits of nodes that 
have more descendants and ancestors than their degree we give each node tokens 
for all their neighbors in the edited graph. As the number of edits is at most m 
the number of initially distributed tokens is in 0{m). Whenever we move a node 
Vm, it generates tokens for all its new neighbors and itself, i.e. in total at most 
2 • d{vm) tokens. Therefore a node has always a token for each of its ancestors 
and descendants and can use that token to account for updating the depth of its 
previous descendants. In each round only 0{m) tokens are generated, therefore 
updating the depth values of a node is in amortized time 0 {d) per node and 
0 (m) per iteration. 

Using the same argument we can also account for the time that is needed for 
updating the pointers of each node to its parents and children and for counting 
the number of initially needed or saved edits. 

What we have shown so far means that once we know the best destination 
we can move a node and update all depth values in time 0 {d) amortized over 
an iteration where all nodes are moved. 

The remaining claim is that we can determine the new parent and the new 
children in time 0(dlog((i)) per node. More precisely we will show that only 
0 {d) nodes are inserted in the queue and we need amortized constant time for 
processing a node. A standard max-heap that needs 0(log(n)) time per operation 
can be used for the implementation of the queue. 

All values that are stored per node need to be initialized for the whole it¬ 
eration. All nodes whose values are changed, which are exactly the nodes that 
have been in the queue at some moment, need to be stored so their values can 
be reset at the end of the processing of a node. 

The basic idea of the main proof is that each neighbor of Vm gets four tokens. 
This is represented by the fact that we increase childciose('w) and scoremax(M) by 
2 for all neighbors u of Vm- When we process a node u, one token is consumed 
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if this node is no neighbor of Vm, then the DFS consumes tokens of childciose(u) 
and at the end the rest of the tokens are passed to the parent. 

Note that all nodes that are processed have childciose(w) > 0 or scoremax(u) > 

0, either initially or after accounting for the fact that they are neighbors of Vm- 

First of all let us only consider processed nodes u with childciose('u) > 0 
initially or after accounting for the fact that u is a neighbor of Vm- We consume 
one token for processing this node in line[^ This is for the whole processing of 
the node apart from the DFS where the accounting is more complicated. Apart 
from the DFS only constant work is done per node, so consuming one taken is 
enough for that. 

First of all note that for each visited node in the DFS only a constant amount 
of work is needed as traversing the tree, i.e. possibly traversing a node multiple 
times can be accounted to the first visit. Obviously without keeping a stack this 
needs a tree structure where we can determine the next child c' after a child 
c of a node u can be determined in constant time. This can be implemented 
by storing in node c the position of c in the array (or list) of children in p(c). 
This also allows deleting entries in the children list in constant time (in an array 
deletion can be implemented as swap with the last child). 

Whenever we visit a node that has not been touched yet or that has childciose(a:) < 
0, we consume one token of childciose(w). When this is not the case, i.e. childciose(a^) > 
— 1, the node has been processed already and we account our visiting of x to the 
processing of x. This is okay as we visit each node only once during a DFS: After 
the DFS starting at u has finished, either childciose('a) > — 1 and an upcoming 
DFS will not descend into the subtree of u anymore or we ended the DFS in 
line 16 and thus have set DFS(m) to the last visited node which means that when 
we visit u in an upcoming DFS, this DFS will directly jump to DFS(m) after 
visiting u. 

Note that by decreasing childciose(M) to —1 we actually consume one more 
token than we had. However for this we only need a constant amount of work 
which can be accounted for by the processing time of u. 

Now we consider nodes u that are processed with scoreniax('u) > 0 initially. If 
we ignore line everything seems to be simple: we consume one token and pass 
the rest to the parent (using the maximum instead of the sum) if a token is left. 
However if we set scoremax(w) to childdose (w) we are getting new tokens out of 
nowhere. Fortunately it turns out we can explain that these tokens are also from 
the scoreniax(c) of all children c of u but the sum instead of the maximum: Note 
that childciose(c) < scoreinax(c) for any node c that has been processed, i.e. after 
line[^childciose(c) < scoreniax(c) holds, then in linej^both are increased by 2 and 
after that only childdose (c) is decreased. As childdose (m) is initially the sum of 
all positive childdose(c) of the children c of u, it follows that initially childdose(u) 
is smaller or equal to the sum of all scoreniax(c) of the children c of u. Therefore 
actually each child c has passed a part of the tokens of scoremax(c) to u in form 
of childdose(^<)- Therefore also line does not create new tokens. 

This means that in total we only process 0{d) nodes and do amortized con¬ 
stant work per node as we have claimed. 
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D Details of the Algorithm proposed by Nastos and Gao 


Nastos and Gao 15 describe that in their greedy algorithm they test each pos¬ 
sible edge addition and deletion (i.e. all 0 {n^) possibilities) in order to choose 
the edit that results in the largest improvement, i.e. the highest decrease of the 
number of induced P 4 and C4. After executing this greedy heuristic they revert 
the last few edits and execute the bounded search tree algorithm. If this results 
in a solution with fewer edits, they repeat this last step until no improvement is 
possible anymore. We chose 10 for the number of edits that are reverted. 

The main question for the implementation is thus how to select the next 
edit. As far as we know it is an open problem if it is possible to determine the 
edit that destroys most P 4 and in time o{ri^). Therefore we concentrate on 
the obvious approach that was also implied by Nastos and Gao: execute each 
possibility and see how the number of P 4 and Ci changes. The main ingredient 
is thus a fast update algorithm for this counter. 

As far as we are aware the fastest update algorithm for counting node-induced 
P 4 and C4 subgraphs needs amortized time 0{h?) for each update where h is 
the /i-index of the graph 
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While the worst-case bound of the ft,-index is 


it has been shown that many real-world social networks have a much lower h- 
index 10 . However this algorithm requires constant-time edge existence checks 


and stores many counts for pairs and triples of edges (though only if they are 
non-zero). 

We implemented a different algorithm which has the same worst-case com¬ 
plexity if we ignore the actual value of h: 0(rn). Furthermore this algorithm is 
much simpler to implement and while it needs 0 {n) additional memory dur¬ 
ing updates only the counter itself needs to be stored between updates. The 
initial counting is thus possible in time 0 {im?), therefore this results in an 
0{m^ + k ■ ■ m) algorithm. Note that the time needed for the initial counting 

is dominated by the time needed for each edit. 

The main idea of the algorithm is that we examine the neighborhood struc¬ 
ture of the edge that shall be deleted or inserted. Using markers we note which 
neighbors are common or exclusive to the two incident nodes. We iterate once 
over each of these three groups of neighbors and over their neighbors which 
needs at most 0{m) time. Based on the status of the markers of these neighbors 
of the neighbors we can count how many times certain structures occur on the 
neighborhood of the edge. Using these counts we can determine how many P 4 
and C 4 were destroyed and created by editing the edge. 

Apart from applying the update algorithm m times there is also a simpler 
Oirn?) counting algorithm which we used for the initialization. Here the idea 
is again that we determine the common and exclusive neighborhoods for each 
edge. Then we only need to iterate over the exclusive neighbors of one of the two 
nodes and check for each of them how many of its neighbors are exclusive to the 
other node. This gives us the number of C 4 that edge is part of. The product 
of the sizes of the exclusive neighborhoods gives us the number of P4 where the 
edge is the central edge plus the number of C 4 the edge is part of. Combining 
both we can get the number of P4 where the edge is the central edge. While the 
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sum of these values already gives the number of P 4 , the sum of the C' 4 -counts 
still needs to be divided by 4. Note that when the graph is a quasi-threshold 
graph, i.e. there are not P 4 and C^, this needs only 0{m ■ A) time. 


E Generated Graphs 

Each connected component of the quasi-threshold graph was generated as reach¬ 
ability graph of a rooted tree. For generating a tree, 0 is the root and each node 
u S { 1 ,..., n — 1 } chooses a parent in { 0 ,..., u — 1 }. 

As shown by [13| many real-world networks including social networks exhibit 
a community size distribution that is similar to a power law distribution. There¬ 
fore we chose a power law sequence with 10 as minimum, 0.2 • n as maximum and 
— 1 as exponent for the component sizes and generated trees of the respective 
sizes. 

For k edits we inserted 0.8 • k new edges and deleted 0.2 • k old edges of 
the quasi-threshold graph chosen uniformly at random. Therefore after these 
modifications the maximum editing distance to the original graph is k. We used a 
more insertions than deletions as preliminary experiments on real-world networks 
showed that during editing much more edges are deleted than inserted. 

In Table|^we show the results for all graphs that we have generated. The first 
column shows the number of random edits we performed. As already mentioned 
in the experimental evaluation for all generated graphs the QTM algorithm finds 
a quasi-threshold graph that is at least as close as the original one. Omitting 
the initialization gives much worse results for low numbers of edits and slightly 
worse results for higher numbers of edits. The lower bound is relatively close to 
the generated and found number of edits for low numbers of edits, for very high 
numbers of edits it is close to its theoretical maximum, n/2. 

All in all this shows that the QTM algorithm finds edits that are reasonable 
but it depends on a good initial heuristic. 
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Table 4: Results for the generated graphs 
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